Open science resources from the Tara Pacific expedition across coral reef and surface ocean ecosystems

The Tara Pacific expedition (2016–2018) sampled coral ecosystems around 32 islands in the Pacific Ocean and the ocean surface waters at 249 locations, resulting in the collection of nearly 58 000 samples. The expedition was designed to systematically study warm-water coral reefs and included the collection of corals, fish, plankton, and seawater samples for advanced biogeochemical, molecular, and imaging analysis. Here we provide a complete description of the sampling methodology, and we explain how to explore and access the different datasets generated by the expedition. Environmental context data were obtained from taxonomic registries, gazetteers, almanacs, climatologies, operational biogeochemical models, and satellite observations. The quality of the different environmental measures has been validated not only by various quality control steps, but also through a global analysis allowing the comparison with known environmental large-scale structures. Such publicly released datasets open the perspective to address a wide range of scientific questions.


Methods
Sampling locations. Tara Pacific deployed a standardized sampling and analysis protocol to offer a comparative suite of samples covering the widest environmental envelope while optimizing cruising and sampling time over the 2.5 years of the sampling effort. Protocols and global objectives of the Tara Pacific expedition were previously detailed for coral samples 13 and are detailed here in connection with the sample registry. Similarly, protocols and global objectives for ocean and atmosphere sampling were previously described 14 Fig. 1) were targeted to cover the widest range of conditions possible, from temperate latitudes to the equator, from the low diversified system of the eastern Pacific to the highly diverse western Pacific warm pool 17 . The variety of coral reef systems explored includes continental islands, remote volcanic islands up to atolls, with varying island sizes or human populations (Table 2). Generally, 3 sites ([S01] to [S03]) per island were selected to conduct the full sampling strategy within 4 days. Occasionally only 2 or up to 5 sites were selected (Table 1).
Sampling coral reef systems. The sampling event sequence and protocols were performed consistently over the whole expedition. Sampling was conducted following the same procedure, approximate timing, and articulated around the same standardized "sampling events" (Fig. 2) which allowed the same collection of samples with a standardized protocol (Table 3). On rare occasions, the timing and protocols were adapted due to sailing conditions and to fit the schedule. Sampling events are characterized by their mode of sampling, which could be either indirectly from Tara's dinghy [ZODIAC] or directly either using scuba-diving ([SCUBA]) or snorkeling ([SNORKEL]). In addition, the sampling device and strategy are included in the sample identifier.
The first set of sampling events (usually in the morning) was mostly devoted to the sampling event [SCUBA-3X10] to sample coral colony fragments. In the meantime, another team pumped underwater, with the [SCUBA-PUMP] to collect coral surrounding water ([CSW]), while the third team snorkeled to capture a total of 10-15 fish using a speargun ([SNORKLE-SPEAR]). A small CTD probe (Castaway CTD) was also deployed from the dinghy down to the reef (generally ~5 to 10 m) to record temperature and conductivity profiles.
The second set of sampling events (usually in the afternoon) was devoted to a survey of coral diversity Sampling coral colonies . During this typical sampling event, a total of 30 coral colonies [C001] to [C030], including 10 colonies for each of the 3 target species (Pocillopora meandrina, Porites lobata, and Millepora platyphylla) were sampled. Each colony was first photographed ([PHOTO]) using a 20 cm quadrat as a scale, their depth recorded and then sampled to collect about 70 grams of each coral by mechanical fragmentation using hammer and chisel. Fragments were placed in Ziploc bags labelled by colony ID and brought back to the boat.
Sampling coral surrounding water [SCUBA-PUMP] and [ZODIAC-NISKIN]. Two Pocillopora meandrina coral colonies [C001] and [C010] were marked with small buoys, and [CSW] samples were collected as close as possible to the coral colony before the actual SCUBA-3X10 sampling to avoid contamination of the water samples with fragments or tissues released during the mechanical fragmentation of coral colony. Then, water was pumped using a manual membrane pump onboard Tara's dinghy that was stationary above the coral colony. A scuba diver was holding a clean water tubing next to the colony while the operator onboard the dinghy was  [SNORKLE-SPEAR]. Fish sampling of two target species (Acanthurus triostegus and Zanclus cornutus) was operated by spear-fishing and snorkeling for a target number of about 10-15 fishes ([F001] to [Fxxx]) depending on the population present. The targets were speared and immediately stored in labeled individual Ziplock bags to avoid contamination between samples and kept inside a floating container to keep them at water temperature.
Sampling sediments and macroalgae [SCUBA-…]. Sediments and macroalgae samples were sampled when encountered during the different dives. Sediment samples (i.e. sand [SSED]) were taken using two 10 mL cryovials near the sampled colony. Macroalgae, ideally brown macroalgae with thallus morphology type arbustive, Coral biodiversity sampling [SCUBA-SURVEY]. Biodiversity sampling transects were conducted in two depths-range environments to sample up to 80 coral colonies ([C041] to [C120]) arbitrarily chosen with ideally up to 40 colonies at a depth of 10-16 m, and up to 40 colonies at a depth of 2-10 m, with an emphasis on sampling across a diverse range of coral hosts at different depths. Two pictures of each colony sampled were taken ([PHOTO]), and small pieces of 1-3 cm 2 were sampled using a hammer and a chisel or a bone cutter.
Sampling surface seawater [ZODIAC-NISKIN] and [ZODIAC-PUMP]. In addition to the seawater collected next to coral colonies explained above, surface ([SRF]) seawater was sampled at 2 m depth using the manual pump on-board of the dinghy ([ZODIAC-PUMP]). The [SRF] site was chosen to be as close as possible from the coral colonies sampled in the morning but with enough water depth that the plankton net sample could be taken at 2 m depth and at least 5 m above the seafloor. When the sampling site was shallower than 7 m, the site was chosen where these sampling conditions could be met within 100 m around the [CSW] sampling site. The water collected was treated similarly to the [SCUBA-PUMP] samples, with the difference that 100 L [SRF] water was collected into two 50 L carboys. The 4 L Nalgene bottles protected from sunlight were also filled with water at 2 m below the dinghy for HPLC filtrations on-board Tara.
Sampling large size plankton . During this surface water pumping, plankton larger than 20 µm were sampled at 2 m below the sea surface using two small diameter bongo plankton nets with 20 µm mesh size, attached to an underwater scooter ([SCUBA-NET-20]) and towed for about 15 min at maximum speed (0.69 ± 0.04 m.s −1 ). The average maximum speed of the net tow was estimated in Taiwan (island 28 site 03) measuring the time it took the diver with full gear on and the nets attached, to travel between two buoys separated by a 9-meters line held tight and floating with the current, to avoid any impact of the current. The measurement was repeated three times facing the current, three times in the same direction as the current, and five times with the current sideways. Each net was equipped with flowmeters, but the speed of the underwater scooter was insufficient to trigger their rotation, therefore the time of sampling was precisely timed to estimate theoretically the volume filtered using the following equations: The volume estimated from the flowmeter reading was about 60 times smaller than the volume calculated theoretically, implying that the flow rate was below the level ensuring proper functioning of the flowmeter. thus, only the theoretical volume will be used in concentration calculations. After 15 minutes of towing, the divers surfaced the two nets and the two cod-ends were sieved through a 2000 µm metallic sieve, into a 2 L Nalgene (r) bottle. The bottle was topped-up with 0.2 µm filtered seawater from the same sampling site and kept at ocean temperature in a bucket during transportation to Tara ). To prevent contamination with coral fragments and tissues released during coring, two [CARB] samples of seawater were taken (one at the surface and one close to the coral colony) before coring and using two 500 mL glass stoppered bottles. Grease was applied to the glass stopper before the dive to allow opening under pressure next to the coral colony. The diver lowered the bottles closed, opened one at 2 m below the surface, and one next to the coral colony. Another seawater sample was taken with a 60 mL HDPE plastic bottle at 2 m depth for subsequent analysis of trace isotopes in relation to the core analysis. Once all seawater was sampled, a 250 mm diameter, 600 round per minute corer from Melun Hydraulique was used to coral cores ([CORE]). Forty coral skeletal cores (40-150 cm long) were collected from  [CS10] and [CS40] samples, that contain respectively 10 g and 40 g of coral material, were stored in Whirlpak ® sample bags, immediately flash frozen in liquid nitrogen, and kept at −20 °C. These samples are intended for subsequent metabolomic analysis for [CS10], physiologic/stress biomarkers (symbiont and animal biomasses, antioxidant capacity and protein damages) and telomeric DNA length for [CS40]. Morphological taxonomic identification [CTAX] samples were performed by drying 5 g of material in 50 ml Falcon tubes, and removing organic material with the addition of 3-4% bleach solution during approximately 2 days. After discarding the bleach solution, clean skeletons were preserved dry at room temperature. For histological measurements of reproduction status [CREP], 5 g of each coral colony was preserved in a 50 ml Falcon tube filled with a 3.5% formaldehyde solution and stored at room temperature. Lastly, for transmission electron microscopy examination of coral intracellular details including viruses [CTEM], 0.1 g of coral tissue was preserved with 250 µL 2% glutaraldehyde and conserved at 4 °C in a fridge.
Macroalgae samples ([MA]), and the seawater collected with them, were firmly shaken to resuspend attached epiphytic organisms. 20 mL of water was transferred into glass vials and fixed with 2% acidic Lugol and stored at 4 °C for future benthic dinoflagellates identification and counts using microscopy ([BDI]), while 100 mL of each replicate were filtered onto a 10 µm pore size polycarbonate filter which was flash frozen and preserved in liquid nitrogen for future metabarcoding analysis ([BDS]).
[SSED] samples were immediately flash frozen when brought back on-board Tara. About 30 to 40 mL of the seawater that was sampled with the coral fragments of [C001] and [C010] and transported in the coral individual Ziplock bags were transferred immediately after the dive into 50 mL falcon tube and stored at water temperature in non-direct ambient light to recover cultures of plankton species closely associated with coral colonies ([IMG-LIVE]).
When fish were recovered onboard, a [PHOTO] was taken, their sex and length were determined before taking a sample of skin mucus ([MUC]) by collecting 1 cm 2 of skin. The fish were then dissected to recover about 3 cm long of the final section of the digestive tract ([GT]) that was preserved in 2 mL cryotubes with 1 ml of DNA/RNA shield and then stored at −20 °C for metagenomic and metabarcoding analyses. One fin sample ([FIN]) was dissected, and preserved into an Eppendorf tube filled with 95° ethanol for population genetic analyses. Lastely, the otolith ([OTO]) was also dissected and stored dry into an Eppendorf tube at room temperature for later aging of each fish.   www.nature.com/scientificdata www.nature.com/scientificdata/ Coral samples obtained from [SCUBA-SURVEY] were collected for symbionts and coral diversity analysis ([CDIV]) using different marker genes (metabarcoding, 18 S, 16 S and ITS2). About 0.5 g of material was preserved with DNA/RNA shield and stored into 2 mL cryotubes at −20 °C.
Finally, samples collected during [SCUBA-CORER] events were also processed and stored onboard Tara. The [CORE] were rinsed with freshwater, air dried for 24-48 h before being wrapped into a plastic bubble wrap for sclerochronological and geochemical analysis, to recover historical water biogeochemical properties. The Water samples for biogeochemistryThe [PH] was measured from the two replicates 5 mL polypropylene vials onboard Tara using an Agilent Technologies Cary 60 UV-Vis Spectrophotometer equipped with an optical fiber. The detailed protocol was previously described 14 , but briefly, the 5 mL vials and the 50 mL falcon tube were kept closed and acclimated to 25 °C for 2-3 h. Absorbance at specific wavelengths was then read before and after the addition of 40 µL meta-Cresol Purple dye to each 5 mL vial. The probe was rinsed between each measurement using the 50 mL falcon tube containing the same seawater as the 5 mL vials samples. TRIS buffer solutions 18 were measured regularly along the cruise to validate the method and correct for potential drifts of pH of the dye solution.
The Niskin bottles of the morning ([CSW] for [C001] colony) and afternoon ([SRF]), carefully kept closed since sampling on the dinghy, were each used to rinse and fill one 500 mL glass stoppered bottle on Tara. Some grease was applied to the glass stopper, and bottles were filled with water samples leaving 2 mm of air below the bottom of the bottleneck. Note that the [CARB] samples associated with the [CORE] samples were already stored in their final container and grease was already applied to the glass stopper before the dive. The water level of these samples was simply adjusted to 2 mm below the bottleneck. All , while all other samples were taken in duplicates. Additionally, all genomic samples were processed to be as comparable as possible with previous existing samples from Tara Oceans 12,15 .
As soon as back on-board Tara, the water collected was used to rinse and fill one (for each [CSW]) or two (for [SRF]) 50 L carboy and two 2 L Nalgene(r) bottles. The content of the 50 L carboys was immediately size-fractionated by sequential filtration onto 3 µm-pore-size polycarbonate membrane filters and 0.22 µm-pore-size polyethersulfone Express Plus membrane filters. Both were placed on top of a woven mesh spacer Dacron 124 mm (Millipore) and stainless-steel filter holder "tripods" (Millipore). Water was directly pumped from the 50 L with a peristaltic pump (Masterflex), and separated into samples that contain particles from 3-20 µm ([S320]) and 0.2-3 µm ([S023]) for latter sequencing. To ensure high-quality RNA, the filtering of the first replicate ([C001] for [CSW] samples and any of the two 50 L carboys for [SRF]) were stopped after 15 minutes of filtration while the second was continued for the full volume (or a maximum of 60 min) to maximize DNA yield. Filters were folded into 5 mL cryovials and preserved in liquid nitrogen immediately after filtration. During this filtration 10 L of 0.2 µm filtered water ([S < 02]) was collected from each replicate, 1 mL of FeCl3 solution was added to flocculate viruses 19 for 1 hour. This solution was then again filtered onto a 1 µm-pore-size polycarbonate membrane filter using the same filtration system as for [S320] [S023]. Filters were then stored in 5 mL cryotubes and stored at 4 °C for later sequencing of viruses. The 80 L remaining of 0.22 µm prefiltered water was used to filter membranes vesicles ([S < 02 > ]) using an ultrafiltration Pellicon2 TFF system by keeping the pressure below 10 psi until the concentrate was reduced to a final volume of 200-300 mL. This sample was further concentrated using a Vivaflow200 TFF system at a recirculation rate of 50-100 mL/min and less than one bar of pressure until obtaining a final sample of 20 mL. Flushing back the system usually brings this volume to up to 40 mL which was stored in a 50 mL Falcon tube at −20 °C. Two 4 mL samples were taken from the 2 L Nalgene bottles, and stored into 5 ml cryotubes fixed with 600 μl of 48% Glycine Betaine and directly flash-frozen for later single cells genomic analysis ([SCG]). For flow cytometry cell counting ([FCM]), two replicates of 1.485 mL of sampled water were placed into 2 mL cryotubes pre-aliquoted with 15 μL of fixative composed of Glutaraldehyde (25%) and PoloXamer (10%). Tubes were then (2023) 10:324 | https://doi.org/10.1038/s41597-022-01757-w www.nature.com/scientificdata www.nature.com/scientificdata/ mixed gently by inversion, incubated 15 min at room temperature in the dark before being flash-frozen, and kept in liquid nitrogen. For scanning electron microscopy ([SEM]), 500 mL of water was filtered onto a 47 mm, 0.22 µm pore size, polycarbonate filter, placed in a petri slide, dried for two hours at 50 °C and conserved at room temperature. Fluorescence In Situ Hybridization ([FISH]) samples were prepared by adding 225 mL of seawater into a 250 mL plastic vial containing 25 ml of 10xPFA. The samples were incubated at 4 °C before filtration onto two 25 mm 0.22 µm pore size polycarbonate filters, rinsed with ethanol, placed in petri slides, dried for 5-10 minutes before being stored at −20 °C.
Samples collected during the [SCUBA-NET-20] were fractionated for sequencing and imaging needs. One litre of the sample collected was filtered onto four 47 mm, 10 µm pore size, polycarbonate membranes (250 mL each). Filters were then placed into 5 mL cryotubes, flash-frozen, and stored in liquid nitrogen for later sequencing ([S20]). 45 mL was subsampled into a 50 ml Falcon tube, fixed with 5 mL of 10% paraformaldehyde and 500 μl of glutaraldehyde 25% EM grade, and stored at 4 °C for future high-throughput confocal microscopy ([H20]; e.g. 20 ). 4 mL was stored in 5 mL cryotubes, fixed with 600 µl of 48% glycine betaine, immediately flash frozen and kept in liquid nitrogen for single cell genomics ([SCG20]). Another sample for single cell sequencing stored in ethanol ([E20]) was done by filtering 100 to 250 mL of the sample onto a 20 µm sieve and re-suspended in EM grade ethanol for 24 h at 4 °C. After incubation, the sample was sieved a second time to remove any trace of seawater, re-suspended with EM grade ethanol into 15 mL falcon tube, and stored at −20 °C. Finally, a 50 mL sample was directly imaged live onboard ([LIVE20]) using a FlowCam 21 Benchtop B2 series equipped with a 4x lens and processed using the auto-image mode.
Oceanic sampling. To obtain both a large scale and local (around coral reef island) environmental characterization, a comprehensive set of physical, chemical and biological properties of the sea surface ecosystem were recorded while cruising. This sampling scheme was framed to be compatible with the previous Tara Ocean expedition measurements 12,15 , but also to provide a continuity with water samples conducted directly on the coral reef. Furthermore, while the biology and ecology of surface ecosystems remain largely unknown, they are an essential component of air-land-sea exchanges and are subjected to numerous hydrological, atmospheric, physical and radiative constraints 22 and are therefore at the frontline of climate change and pollution.
The main goals and general overview of this sampling are already described 14,23 and will be briefly presented here in the context of the different sampling events and samples that were generated. Measurements and samples could be separated into two types: i. local samples originating from a local sampling event, and ii. autonomous high frequency continuous measurements of atmospheric and surface seawater properties (e.g., per minute averages of higher frequency measurements). In the case of the discrete water sampling, the different sampling events were either attributed to a station (noted Sampling events. Sampling was organized following several successive events, generally at daily frequency, in the morning. Water collection while cruising was carried out by a custom-made underway pumping system nicknamed the [DOLPHIN] connected by a 4 cm diameter reinforced tubing to a large volume industrial peristaltic pump (max flow rate = 3 m 3 h −1 ) on the deck. The system was equipped with a metallic pre-filter of 2 mm mesh size, two debubblers, and a flowmeter to record the volume of water sampled. Unfiltered water was collected first for a series of protocols, water was prefiltered using a 20 µm sieve to rinse and fill two 50 L. Both unfiltered seawater use and 20 µm filtered seawater were labelled as [CARBOY]. To collect larger plankton, water was pumped from the DOLPHIN into a 20 µm net fixed on the wetlab's wall ([DECKNET-20]) for 1 to 2 hours depending on biomass concentration simultaneously to a net tow using a "high speed net" ([HSN-NET-300]). The HSN was equipped with 300 µm mesh sized net and designed to be efficient up to 9 knots. It was towed from 60 to 90 minutes depending on the plankton density. Near islands and in the Great Pacific Garbage Patch, a Manta net ([MANTA-NET-300]) with a 0.16 × 0.6 m mouth opening with a 4 m long net with 300 µm mesh size was used concurrently at a maximum speed of 3 knots. Finally, trace metal samples ([MTE-USC]) were collected from the bow using a metal-free carbon fibber pole [HANDHELD-BOW-POLE] on which a plastic fixation have been added to insert a 125 mL low density polyethylene bottle (LDPE) which was previously pre-washed on land and stored individually in separate Ziploc bags. To avoid contamination from the boat, samples were hand held collected, wearing polyethylene gloves, while cruising upwind on the bow of the boat (i.e., before the boat got in contact with the collected water; Fig. 3).
Samples processing. Water, plankton and aerosols samples collected in the vicinity of islands and from the open sea were processed as much as possible following similar protocols than on islands. Samples collected both on islands and in open sea are marked with asterisks* here, and only the few differences in protocols will be noted. From Dolphin-DecknetOnce the [DECKNET-20] time limit reached (between 1 and 2 hours), the flow was stopped and the net was carefully rinsed with 0.2 µm filtered seawater. The plankton sample was then transferred to a 2 L Nalgene bottle and completed to 2 L with 0.2 µm filtered seawater. The sample was homogenized by repeated smooth bottle flips and split into four 250 mL subsamples for [S20]*, one 250 mL sample for [E20]*, one 250 mL sample for [LIVE20]*, and one 45 mL sample for [H20]*. In addition to these already described protocols, one 250 mL sample was also taken for [L20], for which the seawater was drained using a 20 µm sieve and the plankton was transferred in a 50 mL Falcon tube and fixed with 1 mL of acidic lugol solution for latter microscopic observations. Finally, a 45 mL sample was taken for [F20], transferred in a 50 mL Falcon tube and fixed with 1 mL of 37% formalin solution and completed to 50 mL with sodium tetraborate decahydrate buffer solution for latter microscopic observations. From HSN/Manta netsOnce recovered, samples collected both by the HSN net and the Manta net followed the same procedure. The net was carefully rinsed from the exterior to drain organisms into the collector. Its content was transferred using 0.2 µm filtered sea water in a 2 L Nalgene Bottle and completed to 2 L. The sample was then homogenized and split in two 1 L samples. The first half was prefiltered onto a 2 mm metallic sieve and filtered onto four 47 mm 10 µm pore size polycarbonate membranes (250 mL each). Filters were then placed into 5 mL cryotubes, flash frozen and conserved in liquid nitrogen for latter sequencing ([S300]). The second fraction was concentrated onto a 200 µm sieve and resuspended in a 250 mL double closure bottle using filtered seawater saturated with sodium tetraborate decahydrate, fixed with 30 mL of 37% formalin solution and stored at room temperature for latter taxonomic and morphological analysis using imaging methods ([F300]).  www.nature.com/scientificdata www.nature.com/scientificdata/ using scanning electron microscope. Twice a day (12 h pumping periods), at approximate dusk and dawn, those filters were changed, [AS] and [ABS] filters were placed into 2 mL cryotubes (2 filters for each [ABS] sample) and immediately flash frozen while [AI] filters were packaged in sterile PetriSlide preloaded with absorbent pads and stored dry at room temperature.
Continuous measurements. As previously described (see 14,23 ), a comprehensive set of sensors were combined to continuously measure several properties of the water but also atmospheric aerosols and meteorological conditions. All sensors were interfaced to be synchronized with the ship's GPS and synchronized in time (UTC time). Surface seawater was pumped continuously through a hull inlet located 1.5 m under the waterline using a membrane pump (10 LPM; Shurflo), circulated through a vortex debubbler, a flow meter, and distributed to a number of flow-through instruments. A thermosalinograph [TSG] (SeaBird Electronics SBE45/SBE38), measured temperature, conductivity, and thus salinity. Salinity measurements where intercalibrated against unfiltered seawater samples [SAL] taken every week from the surface ocean, and corrected for any observed bias. Moreover, temperature and salinity measurements were validated against Argo floats data collocated with Tara. A CDOM fluorometer [WSCD] (WETLabs), measured the fluorescence of coloured dissolved organic matter [fdom]. An [ACS] spectrophotometer (WETLabs) measured hyperspectral (4 nm resolution) attenuation and absorption in the visible and near infrared except between Panama and Tahiti where an AC-9 multispectral spectrophotometer (WETLabs) was used instead. A filter-switch system was installed upstream of the [ACS] to direct the flow through a 0.2 µm filter for 10 minutes every hour before being circulated through the [ 31 . A brief description of the methods to analyse, calibrate, correct, and estimate bio-optical proxies are detailed in the section Technical Validation and more extensively explained in each processing report attached with the dataset.
An The SMPS was set to perform a full scan of particle distribution every 5 min and the EDM produced a particle size distribution every 60 s. Data provided from [EDM] includes both the total particle concentration (nb cm −3 ) in the size range 0.25-32 µm every 60 seconds, and through a second dataset averaged every 30 minutes, both the particle concentration (nb cm −3 ) together with its normalized size distribution (dN/dlogDp (nb cm −3 ), i.e., the concentration divided by the log of the size width of the bin),while data from [SMPS] are averaged at the hour scale and provided both at the scale of particle concentration (nb cm −3 ) together with its normalized size distribution (dN/dlogDp (nb cm −3 )).
Together with navigation data such as speed over ground [sog] and course over ground [cog] meteorological station (BATOS-II, Météo France) measured air temperature, relative humidity, and atmospheric pressure at 7 m above sea level. True and apparent wind speed and direction was measured at about 27 m above sea level. In October 2016 a Photosynthetically Active Radiation [par] sensor (Biospherical Instruments Inc. QCR-2150) was mounted at the stern of the boat (~5 m altitude).

Data records
The full collection of datasets has been deposited either at Pangaea or at Zenodo depending on their nature, but also on the likelihood to be updated.
Provenance metadata. Tara Pacific datasets are articulated around a consistent set of provenance metadata that provide temporal (UTC date and time) and spatial (latitude, longitude, depth or altitude) references as well as annotations about environmental features and place names, using controlled vocabulary from the environmental ontology (https://www.ebi.ac.uk/ols/ontologies/envo) and the marine regions gazetteers (https://www.marineregions.org/). These metadata are available at three granular levels: sampling stations and sites, sampling events, and samples collected at a specific depth.
A [sampling-design-label] is provided to facilitate the identification and integration of data that originate from the same open ocean station (OA###), island (I##), site (S##) or coral colony (C###), and hence share provenance and environmental context. For example, data originating from coral colony number twelve on the second site of the fourth island visited by Tara will bear the sampling design label OA000-I04-S02-C012. Similarly, data collected at station number 99 in the middle of the Pacific Ocean will bear the sampling design (2023) 10:324 | https://doi.org/10.1038/s41597-022-01757-w www.nature.com/scientificdata www.nature.com/scientificdata/ label OA099-I00-S00-C000, and data collected at open ocean station number 41 within 200 nautical miles of island number four will bear the sampling design label OA041-I04-S00-C000.
Each sample is also characterized by its sampling event which have several properties such as its date and time (UTC) of sampling ([sampling-event_date_time-utc]), the type of event from which the sample originates ([sampling-event_device_label]), the material sampled ([sample-material_label]; see Table 3), the protocol used ([sampling-protocol_label]; see Table 3) and finally the barcode attributed to the final sample obtained and replicated on the logsheets ([sample-storage_container-label]). Finally, each sample, in addition to its original barcode was characterized by an event label and a sample label composed of that previous information such as: The provenance context of all samples collected during the Tara Pacific Expedition is available as a single UTF-8 encoded tab-separated-values file, in open access at Zenodo and replicated in part at BioSamples (XYZ). In addition to georeferences and place names, the provenance metadata includes sample unique identifiers, taxonomic annotation from NCBI, and links to sampling logsheets and campaign summary reports.
Additionally, the full repository containing the campaign summary reports, sampling authorisations, logsheets and the full record of coral images could be consulted on Pangaea (https://store.pangaea.de/Projects/ TARA-PACIFIC/). The full list of sampling events is consultable on the following dataset 32 : https://doi. org/10.1594/PANGAEA.944548.
Environmental context for data analysis. Rich collection of environmental parameters collected from either samples, on-board measurements, satellite imagery, operational models or even calculated from astronomical atlas were compiled and made available for further analysis. These environmental measurements were provided in a multi-layered way in open access to either Pangaea or Zenodo (Tables 4 and 5

Combined version at the event level.
A compilation of all environmental measures obtained during a given sampling event was produced by compiling the boat's sensor data available during the time-lapse of the station and measurements originating from satellite imagery (MODIS-AQUA satellite -Level 3 mapped product, 8-day average, 4 km resolution) recovered using OpenDAB protocols at https://oceandata.sci.gsfc.nasa.gov. The zone corresponding to the station position and date was recovered either by taking a two-pixel buffer around the given location (total zone being a 5 by 5 pixels square of 20 km side) and in order to propose an alternative measure in the inevitable case where clouds were present an alternative 12-pixels buffer was taken (total zone being a 25 by 25 pixels square of 100 km side).
The corresponding variables recovered are chlorophyll a 38  This compilation of environmental data at the scale of the event was further enriched using data from reanalyzed (ie. forced with observations) operational models obtained from Copernicus Marine Services (GLOBAL_ REANALYSIS_PHY_001_030 46 , daily mean for sea surface height, salinity, temperature, current speeds, mixed layer depth; GLOBAL_REANALYSIS_BIO_001_029 47 daily mean for Chl a, phytoplankton carbon, O 2 , NO 3 , PO 4 , SIOH, Fe concentrations, Primary production, pH and CO2 partial pressure and GLOBAL_REANALYSIS_ WAV_001_032-TDS 48 for sea surface waves) but also using almanach 49,50 to calculate essential sun and moon parameters (position, rises and sets, phase, etc).
Environmental context at the granularity of samples. The environmental context of all samples collected during the Tara Pacific Expedition is available together with the provenance file in open access at Zenodo. The environmental context of each sample is provided based on environmental data sets described above for continuous and discrete measurements, as well as those generated from almanacs, satellite imagery and models.
Environmental context is provided in eleven UTF-8 encoded tab-separated-values files, all with the same structure, but each providing a different statistic: number of values (n), mean value (mean), standard deviation (stdev), 05, 25, 50, 75 and 95 percentiles (P05, P25, P50, P75, P95), lag in time (dt), i.e. difference between the collection date/time of the sample and that of the environmental context provided, lag in horizontal space (dxy), i.e. distance between the collection location of the sample and that of the environmental context provided, and lag in vertical space (dz), i.e. difference between the collection depth/altitude of the sample and that of the environmental context provided.
Missing value terms are: "nav" = not-available, i.e. the expected information is not given because it has not been collected or generated; "npr" = not-provided, i.e. the expected information has been collected or generated but it is not given, i.e. a value may be available in a later version or may be obtained by contacting the data providers; "nac" = confidential, i.e. the expected information has been collected or generated but is not available openly because of privacy concerns; "nap" = not-applicable, i.e. no information is expected for this combination of parameter, environment and/or method, e.g. depth below seabed cannot be informed for a sample collected in the water or the atmosphere Simplified version at site level. In some cases, certain parameters were not available at specific sampling sites due to technical issues or sensor availability, however, various basin scale studies and statistical tests require a complete dataset for all sampled sites. During the Tara Pacific expedition, many parameters were concurrently measured in-situ, estimated from remote sensing and/or modelled. For instance, sea surface temperature was measured on the boat using the thermosalinograph included in the underway system, but also with satellite www.nature.com/scientificdata www.nature.com/scientificdata/ and estimated from a model. Each of these three modes of acquisition have their caveat and accuracy, however, within a certain confidence interval, missing in-situ data can be replaced by its remotely sensed or modelled equivalent. We provide here a simplified version at the sampling site level by replacing missing in-situ data by their closest and most accurate satellite or modelled equivalent. In each case, in-situ data was considered as the most accurate source of data, with a preference to HPLC pigments analysis followed by measurements done by the ACS, while satellite and modelled data were used only if in-situ data was not available. We evaluated the accuracy of ACS and of each satellite and modelled datasets by linear regressions with their in-situ counterparts. A bias of the modelled or satellite data was identified when the slope of the regression was different to 1 and/or an intercept was different to 0. The satellite and modelled data were forced to match the in-situ data by dividing by the slope and subtracting the intercept. This is the case for SST. When large bias persisted between matchups with observations, the corrected data was not used to replace missing in-situ data. This is the case for chl. The same approach was then applied to fill missing data with modelled values (MERCATOR-Copernicus).
A correction for the bias in the following variable was applied for SST, SSS, PO 4 , and SiOH. As previously done, if large bias persisted between observations and corrected data, they were not used to replace missing in-situ data. This is the case for chl, NO 3 , and Fe.
The [MTE] samples were sometimes sampled in the afternoon instead of the morning alongside all the other water samples, thus were located in between two sampling stations. These [MTE] samples could not be assigned to a sampling station following the criterion presented in the section 3, therefore, the missing values of the corresponding morning stations were interpolated linearly.
The same approach was used for pH measurements, with a preference from measurements provided by total carbonate system quantifications, followed by direct pH measurements and then modeled values (MERCATOR-Copernicus).
Lagrangian and Eulerian diagnostics. In order to provide a description of the dynamical properties of the water masses sampled, different Eulerian and Lagrangian diagnostics were calculated. Here, we report a general description of the information each of them provides. In the next subsection, we provide the details of how they were calculated for each station.
The  Table 5. Data sets providing the provenance and the environmental context for future analysis and provided aggregated at the sample, event and site levels.
www.nature.com/scientificdata www.nature.com/scientificdata/ in that they receive waters coming from different origins, and that are then spread over several different destinations. These can represent possible hotspots driving biodiversity 51 . Lagrangian Divergence 53 ([LagrDiverg], d −1 ). This diagnostic was calculated by integrating the Eulerian divergence along the backward trajectories. If positive, it indicates a water mass that, during the previous days, was subjected to a strong divergence, thus to a possible upwelling. If negative, it indicates a strong convergence, thus possible downwelling. Retention Time 54 ([RetentionTime], d). This diagnostic indicates how many days a water mass has spent inside an eddy in the previous period. If the water mass is outside an eddy, then its retention time is set to zero. Extraction of the Eulerian and Lagrangian diagnosticsFor each of the 246 stations sampled, we proceeded as follows.
We identified the water mass sampled at the given station. This was considered as a stadium shape with the two semi-circles centered on the starting and ending points of the transect, respectively. The radius of the stadium semi-circles was considered 0.1°, which is in accordance with previous studies 51, 55,56 . The stadium was filled with virtual particles separated by 0.01°.
For each virtual particle inside the stadium shape, we calculated a Eulerian or Lagrangian diagnostic (described above). The Eulerian diagnostics were extracted directly from the velocity field of the day of sampling. Concerning the Lagrangian diagnostics, these were obtained by advecting the virtual particle backward in time for an amount of time τ from the day of sampling day_S. For the Lagrangian betweenness, the advection was performed between day_S + τ/2 and day_S-τ/2, so that the advective time window was centered on the sampling day (details in 51 ).
For the Lagrangian diagnostics, we used the following advective times τ: 5,10,15,20,30, and 60 days. The only exception is the retention time, which, by construction, was calculated only with the largest advective time, namely τ = 60 days.
Once that, a given diagnostic (Eulerian or Lagrangian) was calculated for all the virtual particles filling the stadium shape, we calculated the mean value, and the 25, 50, and 75 percentiles. The percentiles were calculated in order to quantify the spatial variation of the diagnostic inside the stadium shape. Therefore, we associated each station with four values (mean, 25, 50, and 75 percentiles) of a given diagnostic.
Furthermore, two different velocity fields were used, which are described as follows. Velocity fields and trajectory calculationBoth the velocity fields were downloaded from E.U. Copernicus Marine Environment Monitoring Service (CMEMS, http://marine.copernicus.eu/). The first velocity field used was MULTIOBS_GLO_PHY_REP_015_004 57 [GlobEkmanDt]. This was produced by combining the altimetry derived geostrophic velocities and modelled Ekman surface currents. It had a spatial resolution of 0.25° and a temporal resolution of one day. The second velocity field was GLOBAL_REANALYSIS_PHY_001_030 46 [GloryS12]. It was obtained by a NEMO model assimilating altimetry and other observations. It had a spatial resolution of 1/12° and a temporal resolution of 1 day.
Historical climate data and indices for climate variability for coral collection sites. It's becoming increasingly clear that stress resilience, in particular thermal tolerance, is shaped not only by maximum monthly mean temperatures (MMMs), but also by long-term and short-term climate variability, even at the scale of reefs [58][59][60] . In order to provide an overview of past climate variability and marine heatwaves experienced by corals sampled at each site, we built a high-resolution historical dataset that spans from 2002 to each sites' sampling date. Ocean skin temperature (11 and 12 µm spectral bands longwave algorithm) was extracted from 1 km resolution level-2 MODIS-Aqua and MODIS-Terra from 2002 to the sampling date and from level-2 VIIRS-SNPP from 2012 to the sampling date. Day and night overpasses were used to maximize data recovery. Following recommendations from NASA Ocean Color (OB.DAAC), only SST products of quality 0 and 1 were used. The 9 closest pixels to the sampling sites of each scene were extracted. All the extracted pixels from the 3 platforms were then averaged daily to obtain daily SST averages and standard deviations time series for each sampling site, from 2002 to the sampling date.
Each time series was first averaged on a Julian day basis to provide a seasonal average. This yearly seasonal average was triplicated and concatenated into a 3−year seasonal cycle to apply a digital low pass filter on the middle year without generating artefacts. A digital low pass filter (filter order 3, pass band ripple 0.1; "filfilt" function in matlab) with 36 Julian days windows was applied to the concatenated time series to remove high frequency noise. The middle year was then extracted from the concatenated time series to recover the seasonal cycle. The sea surface temperature anomaly was calculated as the SST minus the seasonal cycle over the full time series. Considering the short periods of missing data (mean of the 95th percentile of the duration of consecutive days with missing data: 9.8 ± 4.1 days), the missing values in the SST and SST anomaly time series were linearly interpolated in order to calculate thermal stress indices. The SST anomaly frequency was calculated as the number of days over the past 52 weeks when the SST anomaly is greater than or equal to 1 °C. Thermal stress indices relevant to coral reef health were then calculated using methodology developed for the Coral Reef Temperature Anomaly Database (CoRTAD) 60 (Table 6). Events of cold temperature accumulation were also reported to cause bleaching and mortality 61,62 , therefore, the same set of indices were calculated for cold stress adapting the CoRTAD method, but using the minimum weekly climatologies (Table 6). Further to that, we checked for previous occurrences of bleaching events at sampled reef sites by matching island coordinates to the Reef Check dataset (reefcheck.org) obtained from Sully et al. 58,63 . For each Tara Pacific island, coordinate we determined that Reef Check site that was closest (in terms of distance in km) and considered only Reef Check data that was within a 10 km circumference.
A condensed table containing single values associated with each sampling site was created extracting the minimum, maximum, sum, averages, standard deviations, and value recorded at the sampling day of each of these indices (detailed in the readme file provided with the dataset). Additional metrics of the last heating and cooling events as well as the time of recovery is also provided to represent the state of thermal stress at the day of sampling.
www.nature.com/scientificdata www.nature.com/scientificdata/ Coral photographic resources and annotations. The [PHOTO] resource consists of two datasets. The first, obtained from the [SCUBA-3X10] protocol, was annotated for genus validation, gross morphological characteristics of the colony, algal contact, presence of boring organisms, sediment contact, predation, and health factors (such as presence of disease and coloration). The acquisition protocol of these annotations is described below. This dataset is also used for the description of morphotypes within each genus for taxonomic annotation in combination with genetic data. The second dataset, obtained following [SCUBA-SURVEY] protocol was used for the taxonomic annotation (as close to genus level as possible) of the coral host of the [CDIV] samples. Of a total of 2,470 CDIV samples, 1711 samples had one or more pictures associated (3,085 total pictures), 759 samples had no photos. Overall, 11,460 coral photographs were generated and annotated allowing for a permanent record of all colonies sampled. All [PHOTO] were transferred to EcoTaxa 64 .
(1) Manual Annotations of in situ colony (CO) photos: Photo analysis for the genus validation and environmental context was conducted using Matlab with code developed and written specifically for the Tara Pacific Expedition 65 . Photos were annotated individually, and annotations were conducted from January to April 2020. To prevent observer bias, photos were randomized, and the annotator was blind to any information regarding the location or the sampling site. The analysis included 1) identification to the genus level, 2) algal contact with types of algal genus if identifiable (Halimeda, Turbinaria, Dictyota, Lobophora, Crustose Coraline Algae (CCA), Sargassum, Galaxaura, other), 3) presence of boring organisms with types if identifiable (Bivalve, Spirobranchus, Tridacna, Urchin, Other Polychaete, Sponge, and Other), 5) contact with sediment (sand), 6) presence of predation marks. Most annotations were boolean operators (yes/no) with identifications added if possible. Several indicators of coral health were also annotated such as if the coral looked unhealthy or showed tissue loss (Yes/No), coloration (light, normal, dark, or bleached), and presence of a pigmentation response (Yes/No). If a pigmentation response was present, the annotator was prompted to determine if it was trematodiasis (Yes/No). Finally, additional notes included but were not limited to the quality of the photo (blurry, poor visibility, coloration), contact with neighbouring hard or soft coral colonies, fish presence in the photograph, snail(s), or hermit crab(s) on the coral, an object in the photograph, etc.
(2) Taxonomic annotations of coral diversity (CDIV) surveys: All images imported in EcoTaxa have been identified at the genus level by taxonomic experts, and crosslinked with genomic identification from metabarcoding based on the V9 region of the 18 S rDNA. Analysis of the 18 S marker aimed to generate coral host taxonomic annotations to the level of genus for every sample. The annotation was generated based on each sample's most abundant 18 S sequence by aligning to the NCBI 'nt' database with taxonomic labels. A 'lowest common ancestor' approach was used when there were multiple best hits. These alignment-based annotations were verified phylogenetically (i.e. taxonomic similarity agreed with sequence similarity). More than half of the samples were not annotated at genus or better level using this approach, due to the lack of resolution of the 18 S V9 marker. Where available, host taxonomic assignments were based on photo annotations. Otherwise, 18S-based annotations were used.

Technical Validation
Numerous steps of quality control were operated at different levels of acquisition to ensure good quality of the different datasets and may vary depending on the type of measurement operated and if it originates from sensors on-board or from samples. inline measurements, models, and satellite data validity. [PAR] measurement validity was checked by first removing physically wrong data (ie. values greater than 0.45 μE cm −2 sec −1 or lower than 0 μE cm −2 sec −1 ) and compared with clear sky matchup measurements from MODIS-Aqua & Terra. Comparison confirmed the good agreement between datasets but also the absence of sensor drift. Temperature and salinity were acquired by the [TSG]. The quality of the whole time series was manually checked, and the temperature validity was assessed  www.nature.com/scientificdata www.nature.com/scientificdata/ by comparing the temperature reading of the two sensors placed at two different places along the inline system. Potential drifts of the temperature sensor was investigated by comparing the temperature time series with satellites' sea surface temperature. Salinity measurements where intercalibrated against unfiltered seawater samples [SAL] taken every week from the surface ocean, and corrected for any observed bias. Moreover, temperature and salinity measurements were validated against Argo floats data collocated with Tara. The [ACS] absorption and attenuation signal due to dissolved matter, drift, and biofouling were estimated between two filter events by interpolating filtered water absorption and attenuation following the shape of the [fdom] from the [WSCD], when available. This method improves data quality in case of strong variation of dissolved matter absorption that the frequency of filter event would not capture properly (e.g. approaching coastal waters or entering a lagoon). When [fdom] data was not available, the filtered absorption and attenuation were linearly interpolated between filter events before being removed from the total absorption and attenuation. From November 13, 2016 to May 6, 2017, the [BB3] was located upstream of the switch system, thus measured total (non-filtered) water all the time. During this period, the volume scattering coefficient of seawater was removed from the raw data counts to obtain the particulate backscattering coefficient [bbp]. The biofouling and instrument drift were estimated comparing values before and after each cleaning events. The biofouling was estimated between two cleaning events by fitting an exponential or linear model to the raw data before removing it from the signal. We advocate to use this period with caution as the data was corrected with theoretical assumptions (i.e. pure seawater scattering and linear or exponential biofouling) that may differ from reality. From May 7 th 2017 to the end of the expedition, the [BB3] was located downstream of the filter-switch system so that, like for the [ , and [WSCD] data were processed following the last recommendations for processing inline 24 , using custom software available at https://github.com/OceanOptics/ InLineAnalysis. The entire time series of measurement were automatically QC to remove artifacts and manually checked and QC for obviously inaccurate measurements due to saturated sensor, low flow rate, bubbles, or poor filtered seawater measurements. The full processing and QC procedure and reports could be accessed together with each dataset.

Sample measurements technical validation. For nutrients [NUT] samples a quality check was done
in several steps. First a visual inspection was done to determine if samples were overfilled or not frozen vertically which may induce sample leakage during the frosting procedure. Secondly any readings too close to detection limits or when duplicate measurements differed by more than 10% were flagged. In this last case, when the difference between two values of the same sample is greater than 10%, it is considered that the high value is not acceptable and is not reported. Finally, the overall quality of the dataset was established by comparing measurements values with Copernicus Marine Services modelling outputs.
For trace metals ([MTE-USC]), any samples in which concentrations were close to detection limits were flagged. A standard produced by the GEOTRACES program (coastal surface seawater standard) was included in each sample run. If the metal concentrations of the standard were outside the GEOTRACES community consensus values, the sample run was rejected. Trace metal concentrations had an average error of 5%.
[HPLC] samples were analysed as described in Ras et al. 2008. All pigments peaks were inspected and quality controlled as good, acceptable or qualitative. Any measurements below detection limits were disregarded.
[FCM] samples were analysed with a FACS Canto II Flow Cytometer equipped with a 488 nm laser 67 and every measurement where cell populations were either complicated, needed manual curation or were impossible were flagged.

Nets collection validity.
To estimate the technical validity of the different nets collection we analysed the raw abundance of living organisms collected conjointly by the [HSN-NET-300] and [MANTA-NET-300] at the same stations, but sequentially in time. Indeed [MANTA-NET-300] is operated at different speeds (3 knots maximum) compared to [HSN-NET-300] (9 knots maximum) and therefore were not deployed simultaneously. Manta nets are commonly used and recognized as a reference type of net while investigating surface plankton 68-70 and we therefore used a set of 24 stations where both were deployed concurrently to estimate the efficiency of the [HSN-NET-300]. For this [F300] samples collected by both nets were imaged using the ZooScan 71 to obtain images of each object collected. Images were then transferred to EcoTaxa 64 and sorted taxonomically to the deepest taxonomic level possible. All results were used to calculate the normalized biovolume size spectra 72 (NBSS) of living organisms for both nets, which is an analogue to abundance per size categories. This NBSS spectra allows investigating the potential under-or over-sampling while investigating it over various sizes of organisms. The NBSS of both nets were giving about the same order of magnitudes of abundances (Fig. 4A) and when inspected along the size spectra between pairs of observations (Fig. 4B) they did not differ largely from 1:1 in 13 cases over the different deployments. A large variability between nets could however be observed at a few stations which could possibly be caused by local plankton patchiness 73 resulting in more variability for [HSN-NET-300] and less for [MANTA-NET-300] due to larger sampling volume. Overall, we can conclude that [HSN-NET-300] and [MANTA-NET-300] are collecting plankton with a relatively similar efficiency even if the larger sampling volume of [MANTA-NET-300] allows a better collection of larger, rare, organisms, as seen from spectra extending to larger sizes (Fig. 4A). Nevertheless, these results show that the use of [HSN-NET-300] may be really useful for underway zooplankton sampling in the situations when it is not possible to stop the ship for regular sampling or on ships of opportunity.
www.nature.com/scientificdata www.nature.com/scientificdata/ Overall biogeochemical data validity. To assess the overall quality and homogeneity of the collected environmental parameters, we conducted a quick multivariate exploration of the dataset to compare it with known biogeography of biogeochemical provinces 74,75 and their associated biogeochemical signatures. For this, we first used data simplified at the site version (see section 4 of Data records), selected only datasets providing a full overview over the geographical range of the expedition, used a box-cox transformation and centred-reduced each variable to equally consider those. This dataset was then analysed through a PCA analysis (Fig. 5). The 3 first components of the PCA analysis were recovered to code for a RGB (red, green, blue) color-coding of each station and better visualize the biogeochemical signature of the station on a map. Finally, those were compared with known biogeochemical provinces extracted from 75 . Despite the different temporal resolution between instantaneous sampling and biogeochemical provinces representing a consensus over several years and seasons, we can see that the main biogeochemical provinces (and associated macroscale oceanic features) as well as their progressive boundaries are well captured by our sampling scheme. Among the notable features, the western Pacific coast of Americas are marked by a strong upwelling signature (with high amount of nutrients and trace metals), the southern Pacific gyre with a high salinity but a low iron and silicate concentration, the central Pacific zone is characterized by high temperature, light and sea surface height, small phytoplankton size (high gamma), with low chlorophyll a and low NO 3 and trace metals (Ni, Cu, Zn, Pb or Cu) concentrations, with the exception of the few stations centred on the equator which clearly display some indicators of local upwelling such as those potentially created by the equatorial upwelling. This first overview clearly shows correspondence with known features related to nutrients and nutrient limitation of plankton, trace metals or even global biogeochemistry [76][77][78] and further shows that the sampling scheme used allowed to sample corals and plankton across a large variety of environmental constraints either on oceanographic, climatic or chemical aspects. The same analysis repeated only using sites realized around islands further confirms this large variety of environmental constraints (Fig. 6). To evaluate the variety of the past temperature history, and notably the impact of past seasonality and heat/cold waves, we further reproduced this analysis using historical temperature and heat/cold waves experienced on coral sites. However, since temperature anomalies and their accumulated degree cooling weeks (DCW) could be negative, only a basic normalization of data was made since box-cox normalization is not suited for negative values. The first axis of the PCA separate islands that suffered intense and recurrent heat-waves (positive values) from those that rather experienced cold-waves (negative values) while the second axis separate cold and highly seasonal islands (positive values) from islands with warm environments with low seasonality (negative values). This analysis further confirms that the selected location also displays a full variety of past history of temperature and heat-waves but also reflects known geographical patterns of bleaching events 58,79 .

Usage Notes
We recommend paying close attention to the various quality flags provided with the raw datasets to avoid using lower quality data if needed. Similarly, to provide the more complete set of observations for each sample, we provided the lag in time (dt), as well as distance in horizontal (dxy) and vertical (dz) space, between the collection timing, latitude/longitude and depth/altitude of the sample and that of the environmental context provided. Depending on the scientific question, future users are encouraged to carefully define reasonable time lag and distances to consider in their study, to avoid including unrealistic associations between samples. Moreover, we extracted contextual data at the event level to simplify the data extraction task. We also provide simplified version at the site level by combining and cross-calibrating all similar variables (e.g. using different sources of SST data to fill gaps of missing data and obtain one merged SST variable). We prioritised observations www.nature.com/scientificdata www.nature.com/scientificdata/ originating from in-situ samples over satellite data, and over modelled data (MERCATOR), and evaluated their correspondence by linear regressions. Potential biases of satellite and modelled data in comparison to in-situ data were corrected applying the slope and intercept of their linear regression to force satellite and modelled data to best match in-situ data. Similarly, we also chose to interpolate some environmental variables that were sampled only few hours before or after the site itself to maximize data recovery for each sampling station. We acknowledge merging different sources of data can introduce differences in variance depending on the source of data used, therefore, we encourage the user to cautiously evaluate the relevance of this merged dataset for their study. Considering the intrinsic heterogeneity of variance between the different datasets, and their potential non-normal distribution, we recommend using appropriate normalisation methods before any multivariate statistical analysis. Here we chose to use box-cox transformation and centred-reduced each variable.
In this version of the dataset the satellite data used is 8-days averages while the in-situ measurements are instantaneous measurements of optical properties averaged over the station sampling period. The 8-days averaging tends to attenuate extreme values and reduces the potential differences between stations. While suited for macro-ecological processes which depend on large temporal and spatial variations of their environment, the use of 8-day average satellite products could be inaccurate to study shorter life cycles of the pico-, nano and micro-plankton.
Moreover, phytoplankton can adjust their light harvesting pigment concentrations according to light exposure, nutrient availability and temperature. These variations are negligible over periods shorter than a day but can become significant over 3-5 days 80 ,referencestherein. Therefore, we advise the users to cautiously use the merged bio-optical variables of this dataset and to verify its compatibility with the research question and potentially replace this 8-day average with shorter time observations if available. As presented in section "3.3. Continuous measurements", the [poc] was estimated from the underway system, both using the measured [cp] 28 ,    www.nature.com/scientificdata www.nature.com/scientificdata/ and [bbp] 29 . The [BB3] sensor have a low signal-to-noise ratio due to its high sensitivity to bubbles in the water line and to accumulation of particles in the sensor, therefore, the [